-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Support groupby operations for decimal dtypes #7731
Conversation
…nt of aggregators.
@vyasr None of the aggregations mentioned in 1) or 2) are currently supported for This is the main issue for all of |
Thanks @codereport! That's helpful. I've updated the list of supported operations in this PR. |
804e623
to
286b686
Compare
After tracing this pretty deep down the stack @codereport helped me identify that these operations are actually not supported on cuda 10.x due to a compiler bug. The C++ tests for this already exist and are explicitly disabled. In the interest of testing this to the extent possible, I've adding a conditional skip with pytest based on the |
Instead of checking the nvcc version we should be able to use the |
@gpucibot merge |
This PR makes some improvements to the groupby/aggregation code that I identified while working on #7731. The main purpose is to make the code logic easier to follow and reduce some unnecessary complexity; I see minor but measurable performance improvements (2-5% for small datasets) as well, but those are mostly just side effects here. Specifically, it makes the following changes: 1. Inlines the logic for dropping unsupported aggregations. The old function was repetitive and necessitated looping over the aggregations twice, whereas the new approach drops unwanted aggregations on the fly so it only loops once. The new code also makes it so that you only construct a C aggregation object once. 2. Merges the logic from `_AggregationFactory` into `Aggregation`, and removes the constructor for `Aggregation`. The one downside here is that the Cython `Aggregation` object's constructor no longer places it in a valid state; however, in practice the object is always constructed via either the `make_aggregation` function or its various factories, and the object's constructor was only every used in `_drop_unsupported_aggs` anyway. The benefit is we remove the fragmentation between these two classes, making the code much more readable, and the `Aggregation` class actually serves a purpose now beyond just providing a single property `kind` that is only used once: it is now the primary way that other Cython files interact with aggregations. This also means that in most places other Cython modules don't need to work with `unique_ptr[aggregation]` as much anymore (although they do still have to move `Aggregation.c_obj` for performance reasons). `make_aggregation` now returns the Cython class instead of the underlying C++ one. 3. Modified all the "allowed aggregations" sets to use the uppercase names of the aggregations. In addition to simplifying the code a tiny bit, this helps reduce confusion between the aggregation names used in Python for pandas compatibility and the libcudf names (for instance, `idxmin` vs `argmin`, now `ARGMIN`). 4. Explicitly defines all the aggregations on a groupby. I discussed this briefly with @shwina, the change has pros and cons. The benefit is that all of these methods are properly documented now, there's less magic (the binding of methods to a class after its definition can be confusing for less experienced Python developers and has a lot of potential gotchas), and we can use the simpler string-based agg definition wherever possible. The downside is that we now have to define all of these methods. I think the change is definitely an improvement, but I'm happy to change it back if anyone can suggest a better alternative. In the long run we probably need to find a way to share both code and docstrings more effectively between all aggregations (DataFrame, Series, and GroupBy). Authors: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Ashwin Srinath (https://github.com/shwina) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #7818
This PR resolves #7687. It also does a bit of cleanup of the internals of the code base. There is more that I would like to do, but I'll probably punt everything to a future PR that I don't directly have to touch for this change in the interest of quickly resolving the issue.
I still need help determining why a few aggregations aren't working. The problems fall into two groups:
var
andstd
aggregations currently don't fail, but they always return columns filled with NULLs. I found the implementation of the dispatch for these methods invariance.cu
/compound.cuh
, and at least nominally it seems like these methods are not currently supported because the correspondingenable_if_t
is based on whether the type satisfiesstd::is_arithmetic
, which decimal types will not. However, I'm not sure whether the problem is that this classification is incorrect and these types are actually supported bylibcudf
, or if there really isn't an implementation; I tried to find one, but there are a lot of different files related to aggregation and I'm sure I didn't find all of them. If we simply don't have an implementation, I can remove these from the list of valid aggregations.mean
,quantile
, andmedian
aggregations all raise aRuntimeError
frombinaryop.hpp
: "Input must have fixed_point data_type." I've traced the error to the CythonGroupBy.aggregate
method, specifically the line where it calls through to the underlyingc_obj
'saggregate
method. The call stack in C++ is pretty deep after that, though, and I haven't yet been able to pinpoint whether the failure is a missing cast somewhere (i.e.libcudf
thinks that the column is a floating point type when it's really not) or if the problem lies elsewhere.Update
Thanks to @codereport, I've now marked all the above as unsupported operations. After some discussion with other devs I've also handled the other extended types. I still need to write tests, but I think this PR is ready for review in its current form to identify if I've missed anything in the implementation.